PDF to JSON: How to Extract Structured Data from PDFs
Three practical approaches to extracting structured data from PDFs into JSON: regex on raw text, template-based extraction, and AI-powered extraction with code for each.
8 post(s)
Three practical approaches to extracting structured data from PDFs into JSON: regex on raw text, template-based extraction, and AI-powered extraction with code for each.
Build an automated invoice processing pipeline that turns raw transaction data into branded PDF invoices. Complete working example with HTML template and API integration.
A practical guide to parsing PDFs for retrieval-augmented generation. Covers chunking strategies, PyMuPDF vs Marker vs LlamaParse, and code for extracting and embedding PDF content.
A practical guide to PDF OCR: how to check if a PDF actually needs OCR, Tesseract vs cloud APIs, and when you should skip OCR entirely by generating PDFs with real text layers.
A complete tutorial for building a Python document pipeline that queries a database, formats data with Jinja2, generates PDFs via API, and delivers them via email or S3.
A hands-on comparison of five ways to extract tables from PDFs in Python: pdfplumber, Camelot, Tabula, AWS Textract, and manual regex. With code, benchmarks, and honest pros and cons.
A head-to-head comparison of Kreuzberg, PyMuPDF, and pdfplumber for Python PDF parsing. Benchmarks, architecture differences, and code examples to help you pick the right tool.
A practical guide to extracting text from PDFs in Python. Covers PyMuPDF, pdfplumber, and when you should skip extraction entirely and just generate a new PDF.